Objectives:
1.To understand the context of the data.
2.To understand if any information can be
extracted from the data.
3.To analyse how this information(if any)
can be used to draw meaningful bussiness conclusions
by using machine learning models.
import pandas as pd
import numpy as np
import io
from io import StringIO
oyo_df = pd.read_csv('SampleCaseStudyData.csv', index_col=[0])
oyo_df.shape
(119390, 22)
#Shape of the data
oyo_df.shape
(119390, 22)
1. Data has 119390 entries with 22 columns.
oyo_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 119390 entries, 0 to 119389 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 type 119390 non-null object 1 canceledFlag 119390 non-null int64 2 time2Checkin 119390 non-null int64 3 arrivalMonth 119390 non-null object 4 arrivalWeek 119390 non-null int64 5 arrivalDay 119390 non-null int64 6 numberWeekendnights 119390 non-null int64 7 numberNights 119390 non-null int64 8 adults 119390 non-null int64 9 chidren 119386 non-null float64 10 country 118902 non-null object 11 segment 119390 non-null object 12 repeatFlag 119390 non-null int64 13 historicCancellations 119390 non-null int64 14 historicBookings 119390 non-null int64 15 roomType 119390 non-null object 16 assignedType 119390 non-null object 17 changesFlag 119390 non-null int64 18 deposit 119390 non-null object 19 waitingDays 119390 non-null int64 20 customerSegment 119390 non-null object 21 numberofRequests 119390 non-null int64 dtypes: float64(1), int64(13), object(8) memory usage: 21.0+ MB
print(oyo_df.type.describe())
print(oyo_df.type.value_counts())
count 119390 unique 2 top C freq 79330 Name: type, dtype: object C 79330 R 40060 Name: type, dtype: int64
import plotly.express as px
import numpy
random_x = list(oyo_df.type.value_counts())
names = ['C','R']
fig = px.pie(values=random_x, names=names)
fig.show()
1. Column 'type' doesn't have any Null or Empty values.
2. It contains two uniques categories 'C' and 'R'.
3. It is assumed that this column represents the type of hotel. Say Resort Hotel (R) and Common Hotel(C).
4. 66.4% of bookings are attempted for 'C' type hotel and 33.6% for 'R' type hotels.
print(oyo_df.canceledFlag.describe())
print(oyo_df.canceledFlag.value_counts())
count 119390.000000 mean 0.370416 std 0.482918 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 1.000000 Name: canceledFlag, dtype: float64 0 75166 1 44224 Name: canceledFlag, dtype: int64
random_x = list(oyo_df.canceledFlag.value_counts())
names = ['0','1']
fig = px.pie(values=random_x, names=names)
fig.show()
1. Column 'canceledFlag' doesn't have any Null or Empty values.
2. It contains two uniques categories '0' and '1'.
3. It is assumed that this column captures information about the hotel booking is canceled or not. Value of '0' represents no cancellation and '1' represents cancellation.
4. 63% of bookings are not cancelled and 37% are cancelled.
oyo_df.time2Checkin.describe()
count 119390.000000 mean 104.011416 std 106.863097 min 0.000000 25% 18.000000 50% 69.000000 75% 160.000000 max 737.000000 Name: time2Checkin, dtype: float64
fig = px.box(oyo_df, y="time2Checkin")
fig.show()
# Let us create a new dummy column called 'arrivalYear' to use for plot violin plot.
oyo_df.insert(2, "arrivalYear", [2018]*119390, True)
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Violin(x=oyo_df['arrivalYear'][ oyo_df['canceledFlag'] == 0 ],
y=oyo_df['time2Checkin'][ oyo_df['canceledFlag'] == 0 ],
legendgroup='Yes', scalegroup=0, name='Not Canceled',
side='positive',
line_color='green')
)
fig.add_trace(go.Violin(x=oyo_df['arrivalYear'][ oyo_df['canceledFlag'] == 1 ],
y=oyo_df['time2Checkin'][ oyo_df['canceledFlag'] == 1 ],
legendgroup=1, scalegroup=0, name='Canceled',
side='negative',
line_color='red')
)
fig.update_traces(meanline_visible=True)
fig.update_layout(violingap=0, violinmode='overlay')
fig.show()
1. There is a clear difference in mean of the both the distributions of 'canceled' and 'Not Canceled' distributions
2. Mean time2checkin of canceled booking is accumulated at 144.
3. Mean time2checkin of 'Not canceled' booking is accumulated at 79.9.
4. This difference in mean can prove to have good predictive power to predict booking cancelation.
dfg = oyo_df.groupby('arrivalMonth').count().reset_index()
fig = px.bar(dfg,
x='arrivalMonth',
y='canceledFlag',
category_orders={"arrivalMonth": ["January", "February", "March", "April","May","June","July","August","September","October","November","December"]})
# plot
fig.show()
1. There is a clear pattern of bookings peaking in summer season.
2. Booking are low in November, December & January.
3. And hotel booking steadyly increased from January till August.
fig = px.box(oyo_df, x="arrivalMonth", y="time2Checkin", color="canceledFlag")
fig.show()
1. Difference between the mean of 'cancelled' and 'Not cancelled' booking is clear and persistent across all the months of year.
# Plotting the bookings by arrival week.
dfg = oyo_df.groupby('arrivalWeek').count().reset_index()
fig = px.bar(dfg,
x='arrivalWeek',
y='canceledFlag',
)
# plot
fig.show()
1. No of bookings ploted by arrivalWeek is inline when plotted by arrivalMonth.
# Plotting the bookings by arrivalDay.
dfg = oyo_df.groupby('arrivalDay').count().reset_index()
fig = px.bar(dfg,
x='arrivalDay',
y='canceledFlag',
)
# plot
fig.show()
1. There is no visual pattern observed for the corelation between no of booking and arrivalDay
fig = px.box(oyo_df, x='canceledFlag', y='numberWeekendnights', points="all")
fig.show()

fig1 = px.box(oyo_df, x='canceledFlag', y='numberNights', points="all")
fig1.show()

1. Distribution of numberWeekendnights & numberNights seems to similar in both 'canceled' and 'Not canceled' booking.
fig1 = px.box(oyo_df, x='canceledFlag', y='adults')
fig1.show()
fig1 = px.box(oyo_df, x='canceledFlag', y='chidren')
fig1.show()
1. Distribution of both adults and children seem to be similar between 'Canceled' and 'Non Canceled' bookings.
# Plot the no of booking by country
dfg_con = oyo_df.groupby('country').count().reset_index()
fig = px.scatter_geo(dfg_con, locations="country",
color="country",
hover_name="country", size="canceledFlag",
projection="natural earth")
fig.show()
1. In the dataset most of the booking data is related to european countries.
2. Booking are lead by Portugal with ~48K booking and followed by England, France and Spain.
oyo_df.segment.value_counts()
onl 56477 off 24219 gro 19811 dir 12606 cor 5295 com 743 avi 237 und 2 Name: segment, dtype: int64
dfg = oyo_df.groupby('segment').count().reset_index()
fig = px.bar(dfg,
x='segment',
y='canceledFlag'
)
fig.show()
1. It is assumed that 'segment' represents the customer segments.
2. It assumed that 'onl' represents online bookings. This segement dominates with 56.4K bookings.
3. It is followed by 'off'(assumed offline ), 'gro' (assumed group) and 'dir' (assumed direct) customer segments.
dfg_rep = oyo_df.groupby('repeatFlag').count().reset_index()
fig = px.bar(dfg_rep, x="repeatFlag", y="canceledFlag")
fig.show()
1. It is assumed that 'repeatFlag'column captures the information about if the booking is from a repeat customer or not.
2. It is assumed '1' represents repeated customer and '0' represents new customer.
3. There is huge skew between repeat and new customrs.
dfg_rep = oyo_df.groupby('historicCancellations').count().reset_index()
fig = px.bar(dfg_rep, x="historicCancellations", y="canceledFlag")
fig.show()
1. It is assumed that 'historicCancellations' column captures the inforamtion about the cancellation history of the customer.
2. Booking with no history of cancellation dominates the segment.
3. This is followed bookings from customers with one historic cancellation.
dfg_rep = oyo_df.groupby('historicBookings').count().reset_index()
fig = px.bar(dfg_rep, x="historicBookings", y="canceledFlag")
fig.show()
1. It is assumed that 'historicBookings' column captures the information on no of booking done by customer in history.
2. This column can be used to feature engineer to mark if the booking is from an repeated customer or new customer.
3. However, the data revealed that ~97% of booking are from new customers.
dfg_rep = oyo_df.groupby('roomType').count().reset_index()
fig = px.bar(dfg_rep, x="roomType", y="canceledFlag")
fig.show()
1. It is assumed that 'roomType' column represents the type of room.
2. Type 'A' dominates the distribution with ~86K bookings.
3. Type 'B' and 'E' follow it with ~19K and 6K bookings respectively.
dfg_rep = oyo_df.groupby('assignedType').count().reset_index()
fig = px.bar(dfg_rep, x="assignedType", y="canceledFlag")
fig.show()
1. It is assumed that 'assignedType' column represents the assigned catogery of room to customer. Sometimes the assigned room type can be different from booking room type. this could be because of overbooking.
2. Comparing the above plot with 'roomType' plot, it can be infered that.
a. roomType 'A' is mostly booked room type.
b. When roomType 'A' is overbooked customer gets assigned with roomType 'D'.
dfg_rep = oyo_df.groupby('changesFlag').count().reset_index()
fig = px.bar(dfg_rep, x="changesFlag", y="canceledFlag")
fig.show()
1. It is assumed that 'changesFlag' column caputures information and any changes made to booking.
2. About ~85% percentage of time people not make any changes to the booking.
dfg_rep = oyo_df.groupby('deposit').count().reset_index()
fig = px.bar(dfg_rep, x="deposit", y="canceledFlag")
fig.show()
# Find the values counts for distribution of waitingDays for all the booking.
oyo_df.waitingDays.sort_values(ascending=True).value_counts()
0 115692
39 227
58 164
44 141
31 127
...
175 1
117 1
89 1
92 1
183 1
Name: waitingDays, Length: 128, dtype: int64
dfg_rep = oyo_df.groupby('waitingDays').count().reset_index()
fig = px.bar(dfg_rep, x="waitingDays", y="canceledFlag")
fig.show()
1. It is assumed that 'waitingDays' column captures the information about no of days the booking was kept under waiting before confirmation.
2. About ~97% of the times booking was confirmed on the same day.
dfg_rep = oyo_df.groupby('customerSegment').count().reset_index()
fig = px.bar(dfg_rep, x="customerSegment", y="canceledFlag")
fig.show()
1. It is assumed that 'customerSegment' column captures the information about cutomer segment. Couldn't infer the meaning of 'C', 'G' & 'T'.
2. Segment 'T' dominates the customer segment with ~96% of bookings.
3. This is followed by 'C' & 'G'
dfg_rep = oyo_df.groupby('numberofRequests').count().reset_index()
fig = px.bar(dfg_rep, x="numberofRequests", y="canceledFlag")
fig.show()
1. It is assumed that 'numberofRequests' column captures information about if there are any additional requests made by customer.
2. Around 59% of time customers do not request for any addtional changes.
With the assumptions made above let us prepare a data dictionary.
##### type : Type of hotel.
##### canceledFlag : Is the booking canceled.
##### time2Checkin : Time between booking and checkin.
##### arrivalMonth : Checkin month while booking.
##### arrivalWeek : Checkin week while booking.
##### arrivalDay : Checkin day while booking.
##### numberWeekendnights : No of weekend nights booked for.
##### numberNights : No of nights booked.
##### adults : No of adults included for occupancy.
##### chidren : No of children included for occupancy.
##### country : Country that hotel beloings to.
##### segment : Customer booking segment.
##### repeatFlag : Is the customer exsisting one or new.
##### historicCancellations : How many times the customer had cancelled the booking historically.
##### historicBookings : How many times the customer did booking previously.
##### roomType : What type of room is booked.
##### assignedType : What type of room is assigned.
##### changesFlag : No of changes request after or while booking.
##### deposit : Deposit type.
##### waitingDays : No of days the booking was kept in waiting state before confirmation.
##### customerSegment : Which segment out of 'C','T' & 'G' does customer belong to.
##### numberofRequests : No of booking done by customer previously.
drp_df = oyo_df.drop(['arrivalYear'], axis = 1)
corr = drp_df.corr()
fig = px.imshow(corr)
fig.update_layout( width=800,
height=800)
fig.show()
1. 'canceledFlag' column show decent corelation with 'time2Checkin' & 'historicCancellation'